February 27, 2019

Objectives

  • Substantive research interests
    • Broader question: Emergence of AfD as party and parliamentary presence - what are the effects on party competition and parliamentarism?
    • Descriptive (preliminary) question: What are the prevalent framings in speeches given by AfD parliamentarians?
    • Contagion hypothesis (diffusion): (Speakers of) other parliamentary groups may take over framings offered by AfD speakers.
    • cp. DFG project “The populist challenge in parliament” (2019-2021, in cooperation with Christian Stecker, Marcel Lewandowsky, Jochen Müller)
  • Methodological interests
    • Validity and intersubjectivity of data-driven, “distant reading” approaches (in the eHumanities)
    • ML/AI: Annotation to gain training data for statistical learning => gold standard annotation
    • Social sciences: Traditions of coding and annotating text data: Quantitative/qualitative content analysis

Focus of the presentation

  • Combining close and distant and close reading (Moretti 2013) is an unfulfilled promise: Software often inhibits combining both perspectives. How to implement workflows for coding and annotating textual data? The polmineR R package is presented as a potential solution.

  • Special focus: Interactive graph annotation as an approach to generate intersubjectively shared interpretations/understandings of discourse patterns.

  • Schedule:
    • Theory is code
    • The MigParl corpus
    • AfD Keywords
    • Graph annotation
    • Conclusions

Technical remarks

  • These slides are an ioslides presentation created using R Markdown. The following single character keyboard shortcuts are active and enable alternate display modes:
    • ‘f’: enable fullscreen mode (note that slides are optimized for fullscreen mode)
    • ‘w’: toggle widescreen mode (not recommended)
    • ‘o’: enable overview mode
  • Appearence may differ slightly between browsers (Firefox / Safari / Chrome).

  • The code for generating the slides is available at a GitHub repository, as we try to follow the ideal of reproducible research.

  • The GitHub repo has a DOI (from zenodo): 10.5281/zenodo.2949021

Theory is code

Combining R and CWB

A design for close and distant reading

  • Why R?
    • the most common programming language in the social sciences
    • comprehensive availability of statistical methods
    • great visualisation capabilities
    • usability: RStudio as IDE
    • reproducible research: R Markdown notebooks
  • Why the Corpus Workbench (CWB)?
    • a classic toolset for corpus analysis
    • indexing and compression of corpora => performance
    • powerful and versatile syntax of the Corpus Query Processor (CQP)
    • permissive license (GPL)
  • NoSQL / Lucene / Elasticsearch are potential alternatives - but not for now

The PolMine Project R Packages

The core family of packages:

  • polmineR: basic vocabulary for corpus analysis

  • RcppCWB: wrapper for the Corpus Workbench (using C++/Rcpp, follow-up on rcqp-package)

  • cwbtools: tools to create and manage CWB indexed corpora

And there are a few other packages:

  • GermaParl: documents and disseminates GermaParl corpus
  • frappp: framework for parsing plenary protocols
  • annolite: light-weight full text display and annotation tool
  • topicanalysis: integrate quantitative/qualitative approaches to topic models
  • gradget: graph annotation widget

polmineR: Objectives

  • performance: if analysis is slow, interaction with the data will suffer

  • portability: painless installation on all major platforms

  • open source: no restrictions and inhibiting licenses

  • usability: make full use of the RStudio IDE

  • documentation: transparency of the methods implemented

  • theory is code: combine quantitative and qualitative methods

Getting started

  • Getting started with polmineR is easy: Assuming that R and RStudio are installed, polmineR can be installed as simple as follows (dependencies such as RcppCWB will be installed automatically). Enter in an R session:
install.packages("polmineR")
  • Get the GermaParl corpus, corpus of plenary debates in the German Bundestag (Blätte and Blessing 2018).
drat::addRepo("polmine") # add CRAN-style repository to known repos
install.packages("GermaParl") # the downloaded package includes a small sample dataset
GermaParl::germaparl_download_corpus() # get the full corpus
  • That’s it. Ready to go.
library(polmineR)
use("GermaParl") # activate the corpora in the GermaParl package, i.e. GERMAPARL

polmineR - the basic vocabulary

One of the ideas of the polmineR package is to offer a basic vocabulary to implement common analytical tasks:

  • create subcorpora: partition(), subset()

  • counting: hits(), count(), dispersion() (see size())

  • create term-document-matrices: as.TermDocumentMatrix()

  • get keywords / feature extraction: features()

  • compute cooccurrences: cooccuurrences(), Cooccurrences()

  • inspect concordances: kwic()

  • recover full text: get_token_stream(), html(), read()

Metadata and partitions/subcorpora

  • This is the “good old” workflow to create partitions (i.e. subcorpora):
p <- partition("GERMAPARL", year = 2001)
m <- partition("GERMAPARL", speaker = "Merkel", regex = TRUE)
  • And there is an emerging new workflow …
am <- corpus("GERMAPARL") %>% subset(speaker == "Angela Merkel")

m <- corpus("GERMAPARL") %>% subset(grep("Merkel", speaker)) # Petra Merkel!

cdu_csu <- corpus("GERMAPARL") %>%
  subset(party %in% c("CDU", "CSU")) %>%
  subset(role != "presidency")
  • You might read the code aloud as follows: “We generate a subcorpus X by taking the corpus GERMAPARL, subsetting it based on criterion Y, …”

Counting and dispersions

dt <- dispersion("GERMAPARL", query = "Flüchtlinge", s_attribute = "year")
barplot(height = dt$count, names.arg = dt$year, las = 2, ylab = "Häufigkeit")

Concordances / KWIC output

q <- '[pos = "NN"] "mit" "Migrationshintergrund"'
corpus("GERMAPARL") %>% kwic(query = q, cqp = TRUE, left = 10, right = 10)

Validating sentiment analaysis

kwic("GERMAPARL", query = "Islam", positivelist = c(good, bad)) %>%
  highlight(lightgreen = good, orange = bad) %>%
  tooltips(setNames(SentiWS[["word"]], SentiWS[["weight"]])) %>%
  knit_print()

Full text output

  • This is how you can recover the fulltext of a subcorpus.
corpus("GERMAPARL") %>% # take the GERMAPARL corpus
  subset(date == "2009-11-10") %>% # create a subcorpus based on a date
  subset(speaker == "Merkel") %>% # get me the speech given by Merkel
  html(height = "250px") %>% # turn it into html
  highlight(list(yellow = c("Bundestag", "Regierung"))) # and highlight words of interest
  • Inspecting the fulltext can be extremely useful to evaluate topic models: This is how you would highlight the most likely terms of a topicmodel using polmineR:
h <- get_highlight_list(BE_lda, partition_obj = ek, no_token = 150)
h <- lapply(h, function(x) x[1:8])

corpus("BE") %>%
  subset(date == "2005-04-28") %>%
  subset(grepl("Körting", speaker)) %>% 
  as.speeches(s_attribute_name = "speaker", verbose = FALSE)[[4]] %>% 
  html(height = "350px") %>%
  highlight(highlight = h)

Data

The MigParl Corpus

  • The following analysis is based on the MigTex corpus.

  • The corpus has been prepared in the MigTex Project (“Textressourcen für die Migrations- und Integrationsforschung”, funding: BMFSFJ)

  • Preparation of all plenary debates in Germany’s regional parliaments (2000-2018) using the “Framework for Parsing Plenary Protocols” (frappp-package)

  • Extraction of a thematic subcorpus using unsupervised learning (topic modelling)

  • Size of the MigParl corpus: 27241205 tokens

  • size without interjections and presidency: 22837376

  • structural annotation: id | speaker | party | role | lp | session | date | regional_state | interjection | year | agenda_item | agenda_item_type | speech | topics | harmonized_topics

As announced initially, our analytical concern is speeches given by AfD parliamentarians.

MigParl by year

AfD in MigParl - tokens

AfD in MigParl - share

MigParl - regional dispersion

So what’s in the data?

  • There is an (unsurprising) peak of debates on migration and integration affairs in 2015.

  • The total number of words spoken by AfD parliamentarians and the relative share has increased, as the AfD made it into an increasing number of regional parliaments.

  • The AfD presence is stronger in the Eastern regional states, corresponding to stronger electoral results there.

AfD Keywords

Term extraction explained

  • To gain a first insight into the thematic foci and linguistic features of AfD speakers, we use the technique of term extraction (Baker 2006).

  • The fundamental idea is to identify terms that occur more often in a corpus of interest compared to a reference corpus than would be expected by chance. The statistical test used is a chi-squared test (Rose et al. 1998).

  • To exemplify the flexibility of polmineR, we move beyond the analysis of single words, and inspect 2- and 3-grams, considering particularly interesting sequences of part-of-speech-tags.

  • What we may learn from the following three tables is that assumed features of populist style remain present when the AfD arrived in parliament: Foreigners and asylum-seekers are an object of concern (using pejorative language), and we see vocabulary that indicates the critique of established parties and elites.

Term extraction I

Term extraction II (ADJA - NN)

Term extraction III (NN-ART-NN)

Graph Annotation

The elusive merit of cooccurrence graphs

  • Cooccurrence graphs are an eye-catcher and have become a popular analytical approach in the eHumanities (Scharloth, Eugster, and Bubenhofer 2013; Lemke and Wiedemann 2016).

  • The visualisations are very suggestive and seem to be a great condensation of ideas we have about discourse.

  • But are these interpretations sound and do they meet standards of intersubjectivity?

  • To start will, I will indicate that there are many choices behind these visualisations that can be contested.

  • The solution I suggest is to work with three-dimensional, interactive graph visualisations that can be annotated (called gradgets, for graph annotation widgets).

polmineR & cooccurrences

  • The polmineR package offers the functionality to get the cooccurrences for a specific query of interest. The default method for calculating cooccurrences is the log-likelihood test.

  • The cooccurrences()-method can be applied to subcorpora / partitions, and corpora.

cooccurrences("GERMAPARL", query = 'Islam', left = 10, right = 10)

Getting all cooccurrences

  • Starting with polmineR v0.7.9.11, the package includes a method to efficiently calculate all cooccurrences in a corpus. Doing this for a GERMAPARL would be as simple as follows.
m <- partition("GERMAPARL", year = 2008, speaker = "Angela Merkel", interjection = F)
drop <- terms(m, p_attribute = "word") %>% noise() %>% unlist()
Cooccurrences(m, p_attribute = "word", left = 5L, right = 5L, stoplist = drop) %>% 
  decode() %>% # 
  ll() %>%
  subset(ll >= 10.83) %>%
  subset(ab_count >= 5) -> coocs
  • Our objective is to obtain the significant cooccurrences of the AfD in parliamentary discourse: We are not just interested in all statistically significant cooccurrences, but more specifically in those that distinguish AfD speech-making from speeches made by parliamentarians of other factions.

  • Accordingly, we get the relevant AfD cooccurrences by way of a difference test (chi-squared statistic) with cooccurrences in speeches by all other parliamentarians. See the code for these slides to learn how this is implemented in polmineR.

  • An analoguous approach to get significant cooccurrences is implemented in the CorporaCoCo R package (Hennessey et al. 2017), see also this research note on co-occurrence comparison techniques.

AfD Cooccurrences

Graph visualisation (2D, N = 100)

Graph-Visualisierung (2D, N = 250)

Graph-Visualisierung (2D, N = 400)

Where we stand

  • The graph layout depends heavily on filter decisions.

  • Filtering is necessary, but there are difficulties to justify filter decisions.

  • Graph visualisation implies many possibilities to provide extra information, but there are perils of information overload.

  • If we try to omit filter decisions, we run into the problem of overwhelming complexity of large graphs.

  • How to handle the complexity and create the foundations for intersubjectivity?

Graph visualisation (3D)

So ‘gradgets’ are the solution suggested here. The links to the following three gradgets offer a visualisation that is interactive in a double sense:

  1. You can turn the visualisation in three-dimensional space
  2. You can click on the edges and nodes, get the concordances that are behind the statistical evaluation, and leave an annotation.

In a real-world workflow, the result of the graph annotation exercise can be stored and put into an online appendix to a publication that explains interpretative results.

So these are the gradgets:

Conclusions

Conclusions

The results of this research are very preliminary:

  • There is a (somewhat surprising) explicit politeness of AfD speakers.

  • It’s the economy: Introducing a redistributive logic as a leitmotiv.

  • There is no autism at all! But a lot of interaction with other parties (and visitors!).

  • Cultivating antagonisms: “Wir” (AfD / AfD-Fraktion) and the others.

  • It’s the economy: Introducing a redistributive logic as a leitmotiv.

But in a way, AfD speeches served only as a case how we might develop the idea of “visual hermeneutics” (Schaal, Kath, and Dumm 2016): If we decide to work with cooccurrence graphs, graph annotation is the approach suggested here to realise the idea of distant and close reading, and to achieve intersubjectivity.

References

Baker, Paul. 2006. Using Corpora in Discourse Analysis. Londing: continuum.

Blätte, Andreas, and Andre Blessing. 2018. “The Germaparl Corpus of Parliamentary Protocols.” In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (Lrec 2018). Miyazaki, Japan: European Language Resources Association (ELRA).

Hennessey, Anthony, Viola Wiegand, Michaela Mahlberg, Christopher R. Tench, and Jamie Lentin. 2017. CorporaCoCo: Corpora Co-Occurrence Comparison. https://CRAN.R-project.org/package=CorporaCoCo.

Lemke, Matthias, and Gregor Wiedemann. 2016. Text Mining in Den Sozialwissenschaften: Grundlagen Und Anwendungen Zwischen Qualitativer Und Quantitativer Diskursanalyse. Wiesbaden: Springer.

Moretti, Franco. 2013. Distant Reading. London: Verso.

Rose, Tony, Adam Kilgarriff, Adam Kilgarriff, and Tony Rose. 1998. “Measures for Corpus Similarity and Homogeneity.” In Proceedings of the 3rd Conference on Empirical Methods in Natural Language Processing, 46–52. ACL-SIGDAT.

Schaal, G.S., R. Kath, and S. Dumm. 2016. “New Visual Hermeneutics.” Cybernetics & Human Knowing 23 (2): 51–75.

Scharloth, Joachim, David Eugster, and Noah Bubenhofer. 2013. “Das Wuchern Der Rhizome. Linguistische Diskursanalyse Und Data-Driven Turn.” In Linguistische Diskursanalyse. Neue Perspektiven, edited by Dietrich Busse and Wolfgang Teubert, 345–80. Wiesbaden: VS Verlag.